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Abstract. Recommender systems daily influence our decisions on the Internet. 
While considerable attention has been given to issues such as recommendation accuracy 
and user privacy, the long-term mutual feedback between a recommender system and 
the decisions of its users has been neglected so far. We propose here a model of network 
evolution which allows us to study the complex dynamics induced by this feedback, 
including the hysteresis effect which is typical for systems with non-linear dynamics. 
Despite the popular belief that recommendation helps users to discover new things, we 
find that the long-term use of recommendation can contribute to the rise of extremely 
popular items and thus ultimately narrow the user choice. These results are supported 
by measurements of the time evolution of item popularity inequality in real systems. 
We show that this adverse effect of recommendation can be tamed by sacrificing part 
of short-term recommendation accuracy. 
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1. Introduction 

Even if we do not notice it, our life on the Internet is influenced by recommendations. 
Popular web sites such as Amazon, Netflix and YouTube attempt to facilitate 
our navigation by suggesting us new possibly relevant items and thus increase our 
satisfaction and their prohts PEiia]. Employed recommendation algorithms range 
from simple variants of “buyers who choose item A also choose item B” n El E] in 
Amazon to more sophisticated techniques such as the singular value decomposition jTj. 
Even though many users still act independently of any automated assistance, the use of 
recommendation is on a rise. For example, the DVD rental company Netflix estimated 
that 75% of the rental choices of their users come from some form of recommendation [8]. 

The rationale behind recommendation is to match the right customers with the 
right products. This task is particularly important and difficult for less popular items 
for which user patterns cannot be easily identihed. Correct matching of little popular 
items is crucial for e-commerce—studies have shown that from 20% to 40% of Amazon’s 
sales do not come from the best selling items [9]. It has been suggested that if one ranks 
items according to their sales and thus constructs a so-called popularity-rank curve (see 
an example in Fig. [T^), there is a long tail which comprises a large number of niche 
items mm- These niche items enjoy a higher proht margin compared to the small 
proht margin determined by a more competitive market of popular items, and can even 
boost the sales of other items by providing a convenient one-stop outlet to users mm- 
In this respect, recommendation algorithms seem to be the best candidate to explore 
the proht that hides in the long tail. 

Recommendation working as intended thus contributes to: (1) Increasing the 
diversity of recommended items, (2) Distributing user attention more evenly among the 
items. In this case, recommendation would cause the long tail in Fig. [1^ to gain more 
weight with time. However, Figs. [Tb-d demonstrate that the opposite is found in reality. 
Despite recommendation algorithms implemented at Nethix, Amazon, and Movielens 
(a web site for movie recommendation), the tail of the popularity distribution becomes 
shorter with time (see also the evolution of the distribution in Fig. SI). Simultaneously, 
the most popular items account for an increasing share of total sales. It is an adverse 
effect if some items become too dominant, similar to the emergence of over-dominant 
species in an ecosystem, which leads to the reduction of biodiversity and ultimately may 
lead the loss of the equilibrium in the system. Evaluation of recommendation exclusively 
on the basis of accuracy-oriented metrics cannot, by its nature, capture and explain this 
long-term behavior. When recommendation is iterated for a small number of rounds 
as in PI, the possible feedback between user choice and the recommender system is 
detected. These interactions between users and the system impact on its macroscopic 
properties and trend, similar to other physical systems. A more physical approach is 
thus needed to gain the hrst insights into the puzzle posed by Fig. [1] 

To understand the long-term impact of recommender systems, one has to study 
the co-evolution of the recommendation and the online user-item network. The 
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recommendation results affect the growth of the network and the change of the network 
meanwhile influence the future recommendation outcome. The effect is amplihed with 
successive recommendations. In this paper, we investigate this issue by repeatedly 
applying recommendation on the user-item network. Our focus is different from 
currently known recommendation studies which aim at short-term metrics such as 
accuracy [H], diversity [15] and others [I6]. We demonstrate that the repeated use 
of usual recommendation algorithms makes the system reach a stationary state where 
user attention is concentrated on a few items instead of distributed over a broad range 
of items which is in agreement with the data presented in Fig. [1] In other words, 
usual recommendation algorithms ultimately narrow user choice and reduce information 
horizons instead of widening them. We also observe a hysteresis phenomenon which 
implies that while recommendation naturally gives rise to hugely popular items, to 
revert this change is not possible because the uneven distribution of user attention 
is robust over a broad range of the recommendation algorithm’s parameters. Our 
observations directly challenge the role of recommendation for online retailers, as well as 
other applications of recommendation in search engines m, online social networks [18] , 
news media [I9], and even suggestions of research papers 1201. We hnally show that 
before some items become too dominant, it is possible to make a compromise between 
recommendation accuracy and long-term effects of recommendation. These results can 
provide insights and motivations for the design of a next generation of recommender 
systems. 

2. Methods 

Data. The data used for the empirical study in Fig. 1 and Fig. SI is described in 
the Supplementary Information. In the following simulation, we use two benchmark 
data sets: Movielens (an online movie rating and recommendation service) and Netflix 
(a DVD rental service). The Movielens data (available at www.grouplens.org) contains 
1682 movies and 943 users who have rated movies using the integer scale ranging from 1 
(worst) to 5 (best). To obtain an unweighted bipartite network, we represent any rating 
of 3 or more as a link between the respective user and item. The resulting network 
contains 82520 links. The average degree of users and items is 88 and 49, respectively. 
The maximum degree of users and items is 509 and 558, respectively. The Netflix data 
is a subset of the original data set released for the purpose of the NetflixPrize (available 
at www.netflixprize.com 1211). The subset contains 1014 users and 2049 movies chosen 
from the original data at random and all links among them (since the input data uses 
the same rating scale as Movielens, the threshold rating of 3 is again used to determine 
whether a link is present or not). The resulting network contains 54093 links. The 
average degree of users and items is 53 and 26, respectively. The maximum degree of 
users and items is 409 and 604, respectively. 

Models and metrics. To model the co-evolution of user choices and the 
recommendations generated from the recommender systems, we construct a model of 
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Figure 1. (a) A simple illustration of the long tail phenomenon, and (b-d) its presence 
in real data produced by various e-commerce systems. The normalized item popularity 
is used to remove the size effect [22]. The same plots in log-log scale are shown in the 
respective insets. Results from different time periods show that popular items become 
more popular compared to the niche items at the tail, contrary to the belief of a thicker 
tail obtained through recommendation algorithms. 


recommendation ecosystem as follows. The real data described above are used as the 
initial conhguration. The network evolves through a so-called rewiring process where 
each link is assigned a time stamp (the initial time stamps are chosen at random). In 
each rewiring step, the oldest link of every user is redirected to a new item and assigned 
with the current time (i.e., it becomes the user’s newest link). Deleting the oldest links 
simulates the case where the recommendation results are generated based on recent 
historical record [23]. We assume that when a new target item for a currently rewired 
link is being chosen, the user follows recommendation with probability p and decides 
independently of recommendation with the complementary probability 1—p. We use the 
widespread Item-based Collaborative Filtering (ICF) as the recommendation algorithm 
(see details below); variants of this algorithm are employed by Amazon and other major 
web sites [5]. When a user follows recommendation, they select an item from their 
current recommendation list with probability inversely proportional to item rank in the 
list (the motivation to use rank-reciprocal rather than equal probability for all listed 
items comes from [2411^ ). When a user acts independently of recommendation, they 
either choose an item at random (which we refer to as Random Attachment, RA) or they 
choose proportionally to item degree increased by one (which we refer to as Preferential 
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Attachment, PA; item degree is incremented by one to make it possible for items which 
lose all their links to be chosen again). As the network evolves, degree of each user 
is preserved. On the other hand, network structure and item degree values change 
signihcantly by rewiring. Network structure can be at any moment represented by a 
so-called network adjacency matrix A whose element Oia is one when user i is currently 
connected with item a and zero otherwise. 

We use the Gini coefficient G to measure inequality of the item popularity 
distribution during the network evolution. While this quantity has been originally 
proposed to quantify inequality of income or wealth distribution [26], it has also been 
used in other fields [2711281I29]. The Gini coefficient can be computed as 

... 2E^.iako. M + 1 

M ' ' ' 

where ka, the popularity of objects, has been sorted in the ascending order. The two 
extreme values of the Gini coefficient are 0 and 1 which correspond to equal popularity 
of all items and zero popularity of all items but one, respectively. An increase of the 
Gini coefficient thus corresponds to the item popularity distribution becoming more 
unequal. We also used other measures, such as the link share of 1% most popular items 
and the Herfindahl index, to study the rewiring model. Results obtained with different 
inequality metrics are consistent with those obtained with the Gini coefficient. 

Recommendation. We use the widely spread Item-based Gollaborative Filtering 
(IGF) as the main recommendation algorithm [30]. In IGF, the recommendation score 
of an item is computed based on the item’s similarity with other items collected by a 
target user. The score of item a for user i reads 

M 

fa^ = E (2) 

/3=1 


where Sag is the similarity of items a and (3, and Oig are elements of the network’s 
adjacency matrix. We choose the following formula for item similarity 


F„nF 
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where F„ denotes the set of users who have collected item a and ka denotes the degree of 
item a. Parameter 6 can be continuously adjusted in the range [0,1] which includes three 
classical cases: the common neighbor similarity which simply counts the number of users 
who have collected both items (when 6* = 0), the cosine similarity which is sometimes 
referred to as the Salton index (when 6 = 1/2) and the Leicht-Holme-Newman similarity 
(when 6 = 1) [31]. Due to a lack of normalization, high-degree items are favored when 
6 = 0. By contrast, low-degree items are favored when 6 > 1/2. Equation (|2]) thus gives 
us the opportunity to gradually move from recommendations biased towards high-degree 
items (when 0 = 0) to diversity-favoring recommendations biased towards low-degree 
items (when 6 = 1). To obtain a recommendation list for a given user, all items that are 
currently not connected with this user are sorted according to their recommendation 
score in a descending order and finally the top L items are kept on the recommendation 
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list. We use L = 20 here which is a common value in previous studies of recommender 
systems [13]. Considering only the top L items is motivated by the fact that users in 
real online systems do not have time to inspect the list of all items ranked by their 
recommendation score (as further reflected in the rank-reciprocal probability in our link 
rewiring process). 

3. Results 

Distribution of item popularity. After a sufficient number of iteration steps in 
our model, the system reaches a stable state and the item degree distribution becomes 
stationary. We begin our analysis by comparing the distribution of item degree in the 
original data with the stationary outcome of the rewiring procedure. We assume here for 
simplicity that the users choose new items solely by recommendation (i.e. p = 1). Our 
implementation of item-based collaborative hltering employs a user similarity metric 
with one parameter given by Eq. (jS]). Three particular cases of our user similarity 
lead to well-known similarity measures: 6 = 0, 1/2,1 correspond to Common Neighbor 
similarity (CN), Cosine similarity (COS), and Leicht-Holme-Newman similarity (LHN), 
respectively. The present parameterization allows us to continuously tune between 
recommendation that favors high-degree (when 6 is small) and low-degree (when 6 
is big) items. This is well demonstrated by Fig. [2] which shows the item degrees 
in the original data compared to those after the rewiring procedures using CN and 
LHN. ICF with CN (CN-ICF) improves the popularity of the most popular items and 
essentially eliminates the long tail. This inferior outcome produced by an otherwise 
well accepted and popular recommendation method demonstrates the potential danger 
of recommendation for information diversity, similar to the loss of biodiversity in an 
ecosystem. On the other hand, LHN-ICF strengthens the long tail but, as we shall see 
later, its recommendation accuracy is low. 

To obtain a quantitative comparison, we measure the Gini coefficient over the item 
degrees. The higher the values, the more uneven the distributions, and the greater 
the loss of information diversity in the system. The Gini coefficient corresponding to 
the distributions shown in Fig. [2] are 0.31, 0.67, 0.88 (Movielens) and 0.28, 0.82, 0.95 
(Netffix) for the LHN-ICF, original data, and CN-ICF, respectively. The LHN-ICF 
method leads to a remarkable equalization of the item popularity, while the CN-ICF 
method further advances the degree heterogeneity. Here we use L = 20. We show in 
Fig. S2 for the results when other L values are used. Moreover, we study in Fig. S3 the 
case where users are influenced by the similarity constraint when selecting items from 
the recommendation list. We want in this way to prevent all information from being 
washed away. 

User reliance on recommendation. In practice, users do not always follow 
recommendations. We thus consider the case with p < 1, i.e. users follow 
recommendation with a probability p and otherwise (with a probability 1 — p) choose 
an item according to preferential attachment (as shown in Fig. S4 and S5, the 
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Figure 2. The normalized item popularity versus the normalized item rank in (a) 
Movielens and (b) Netflix. The same plots in log-log scale are shown in the respective 
insets. 

resulting behavior is similar if preferential attachment is replaced by random or real 
choice of items). The value of p thus characterizes users’ reliance on recommendation, 
and parameter 6 controls the popularity bias of recommendations. While users’ 
recommendation lists are populated mostly with popular items when 6 is large, the 
number of niche (little popular) items increases when 6 is small. We repeatedly rewire 
the input data and again quantify the diversity of the system by the stationary value of 
the Gini coefficient, G*. As shown in Fig. [HH-b for various p values, G* is not sensitive 
to 6 over a broad range [0, 6] where 9 0.6. Once 6 > 6 , G* decreases quickly and 

eventually reaches values lower than those produced by preferential attachment only. 
Another important hnding out of our expectations is that, recommendation can hurt 
information diversity even more than preferential attachment (i.e., IGF can lead to 
higher G* than PA). This is because PA is not personalized and biased on popularity, 
but any items may be chosen; recommendation algorithms are personalized, biased on 
popularity and only the top L items identified by the algorithms are recommended. If 
the top L items for different users significantly overlap, they will attract a large amount 
of links. This strong tendency of recommendation to decrease information diversity is 
a joint outcome of a popularity-favoring recommendation method and the fact that the 
users are recommended with a short list of items which further contributes to the “winner 
takes it all” situation. We also consider a different recommendation algorithm [15] which 
show remarkable similarity with the results for IGF presented in this paper (See results 
in Fig. S6). 

The impact of data density. The high density of these two data sets enables us 
to adjust the density of the network by removing some links. In Fig. |3t-d, we study the 
effect of data density on the resulting popularity inequality when p = 1 (see Fig. S7 for 
the results with p < 1). we randomly remove links from the original network until the 
desired density is reached. Note that the last link of the users will never be removed in 
this process. The effect of data density is particularly strong when recommendations are 
made with the diversity-favoring LHN-IGF method which leads to low G* when the data 
density is high but does the opposite when the data density is low. This can be explained 
























Modeling mutual feedback between users and recommender systems 


by the presence of objects with only a few links in low-density data: once those links are 
rewired to other objects, the respective object cannot be any more recommended and 
it effectively disappears from the system, thus contributing to an increased popularity 
inequality and a high G*. 

Nevertheless, a decrease in the data density does not always increase Gini value. 
These observations may come from an interesting phenomenon, which can be shown 
by two different simulations. In the hrst case, users are randomly removed from the 
system and G* is found to increase with decreasing data density (see Fig. S8(c)). In the 
second case, items are removed at random, and by contrast G* decreases with decreasing 
data density (see Fig. S8(d)). While these two scenarios look similar, the average item 
degree decreases in hrst case (see Fig. S8(a)) and is preserved in the second case (see 
Fig. S8(b)). In other words, even though data density decreases, recommendation 
algorithms work equally effectively to distribute the popularity given the average item 
degree is preserved. This may come from the fact that the nature of an item can be 
rehected by the characteristics of individual users who have collected it, the degree 
of an item represents the amount of available information on it. As a result, when 
the average item degree is preserved, recommendation algorithm works effectively since 
there are sufhcient information on the items. These results show that the degree of an 
item is not just merely related to its role in the network, but may also represents the 
amount of information we possess on it. The effectiveness of recommendation algorithms 
are thus strongly dependent on the average item degree. 

Hysteresis. The task of choosing the recommendation parameter 6 is made more 
important by a pronounced hysteresis phenomenon. Fig. H] shows that while a state with 
low inequality achieved with 6 = 1 can be fully reverted to a high inequality state by 
changing 6* to 0, the opposite is not true. In other words, once high concentration of item 
popularity has set in, it can only be partially corrected by the use of a diversity-favoring 
recommendation method. This is also related to the amount of information available on 
the objects - after we lose information on some objects in a recommendation algorithm, 
we may not retrieve it again. Effective recommendation cannot be made for these items 
and the popularity remains unevenly distributed, even with diversity-favoring methods. 
In Fig. S7, we show that the hysteresis phenomenon exists also when p < 1. We 
thus conclude that G* depends on the system’s initial condition, especially when it 
corresponds to highly heterogeneous item popularity. 

Trade-off between diversity and accuracy. The previous hgures show that 
the use of high 6 can limit or even reverse the potential popularity-concentrating 
effect of recommendation. However, high 6 generally leads to low recommendation 
accuracy because it tends to recommend objects of low degree [15]. This motivates us 
to investigate the trade-off between the possibly decreased stationary Gini coefficient G* 
and recommendation precision P measured on the original input data before rewiring. 
To measure recommendation accuracy, we apply the standard evaluation procedure 
which is based on randomly dividing the network data into two parts: a training set 
comprising 90% of all links and a probe set comprising the rest. The training data 
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Figure 3. (a,b) The effect of the ICF’s parameter 9 on the stationary Gini 

coefficient; links which are not drawn based on recommendation are drawn based 
on preferential attachment. (c,d) The effect of data density on the stationary Gini 
coefficient for different recommendation methods (here all links are drawn based on 
recommendation). The curve labeled original is the Gini coefficient of the network 
after density modification and before rewiring. 


E'^ is then used to compute recommendation lists for all users. The standard accuracy 
metric called precision is based on comparing these recommendation lists with the probe 
data E^\ a good recommendation algorithm is expected to be able to reproduce a large 
part of E^ based on E'^ [16]. If for user there are di{L) probe entries related to this 
user in Ts recommendation list, we say that the recommendation precision for this user is 
Pi{L) := di{L)/L. By averaging this quantity over all users with at least one entry in the 
probe set E^, we obtain the overall recommendation precision P{L). To further remove 
its possible dependence on the data division, we average precision over ten independent 
training set-probe set divisions. Based on the training-probe sets division, one can also 
measure the short-term recommendation diversity which is simply the average degree of 
items that appear in the recommendation lists. Fig. S9 shows the relation between the 
long-term Gini coefficient and short-term recommendation diversity. 

Fig. [5^ simultaneously plots G* and P as a function of 9 in the Movielens data. 
One can see that the highest precision is achieved at 6^ ~ 0.6 which is a point where 
the stationary Gini decreases quickly with 9. This gives us the possibility to lose some 
precision by increasing 9 from its optimal value and in turn achieve a substantial decrease 
of G*. This is demonstrated by Fig. [5b where the desired value of G* is plotted on the 
horizontal axis: the resulting precision hrst decreases rather slowly from its highest value 
as the desired stationary Gini coefficient is lowered. Only when the desired G* is lower 
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Figure 4. Evolution of the Gini coefficient starting from different initial 
configurations, low-G* achieved with 9 = 1 and high-G* achieved with 0 = 0, in 
(a) Movielens and (b) Netflix data, (c) and (d) show the effect of 9 on the stationary 
Gini (IGF method) under different initial configurations. 


than the Gini value in the original data, precision decreases sharply. The situation is less 
favorable for the Netflix data where precision is maximized at 6* 0.57 which lies in the 

region where G* changes rather slowly with 6 as shown in Fig. 5b. As a result, one has 
to substantially increase 6 in order to achieve a signihcant decrease of G*. This manifests 
itself in Fig. [5li which lacks the gentle slope region seen in Fig. |5t. Nevertheless, one can 
limit G* to the values seen in the original Movielens and Netflix data by sacrificing 17% 
and 12% of the optimal recommendation precision, respectively. These results show the 
possibility to compromise recommendation accuracy and long-term impacts on diversity 
for recommendation systems. 

4. Discussion 

By studying the co-evolution of the recommendations generated from the recommender 
systems and users’ choices, we demonstrate the long-term effect of recommendation 
on the distribution of item popularity. This novel approach to the evaluation of 
recommendation performance gives us the possibility to observe new phenomena. 
Contrary to the common belief that recommendation helps to match niche items with 
users who may appreciate them and thus contribute to improving their recognition, we 
show that typical recommendation methods reinforce the position of already popular 
items at the cost of niche items. This is particularly true when recommendation 






































Modeling mutual feedback between users and recommender systems 


11 


(a) Movielens (b) Netflix 



(c) Movielens (d) Netflix 




Desired Gini coefficient 


Desired Gini coefficient 


Figure 5. The dependence of the stationary Gini values and recommendation precision 
on 0 in (a) Movielens and (b) Netflix data. Panel (c) and (d) show the relation between 
the desired stationary Gini coefficient and the short-term recommendation precision. 


algorithms optimized for their accuracy are used because these tend to favor popular 
items. Our observations suggest that recommendation may divert the system to a state 
where a few items enjoy extraordinarily high levels of user attention. Furthermore, we 
found a strong hysteresis effect which implies that this state is very robust and can be 
hardly changed back to a state with a more even popularity distribution, even with 
the help of a diversity-favoring recommendation algorithm of lesser accuracy. All these 
are adverse effects if one considers the interaction between users and the recommender 
system as an information ecosystem. We remark that although it is not ideal for buyers 
to receive recommendations of common popular items, such systems may still beneht 
the sellers by lowering the effective number of distinct items that they need to keep in 
their inventory and thus reducing the costs for logistics. 

To the best of our knowledge, our model is the hrst one to analyze the long-term 
influence of recommendation on the evolution of online systems. We tested many 
different choices of our model including examining different original data sets and 
another recommendation method, combining recommendation with random attachment 
instead of preferential attachment, preventing the real information from being washed 
out by repeated rewiring, and setting different lengths of the recommendation list. Our 
hnding of the adverse effect of a sub-optimal recommendation system on information 
ecosystems in the long run is in general still valid in these cases and thus needs to be 
seriously considered in practice. Our work raises a number of questions which aim 
to further strengthen our understanding of the long term influence. For example. 
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the model now assumes that all users accept the recommendation with the same 
probability. One can actually instead consider a scenario where experienced users 
search for items on their own and thus depend less on recommendation. We measure 
recommendation accuracy before the rewiring process. It would be interesting to monitor 
the recommendation accuracy during the network’s evolution, and couple it with the 
rate at which users accept it (the higher the accuracy, the higher the probability that 
users follow recommendation). These more complicated scenarios may make the results 
of the model quantitatively different from the current ones. A systematic study on these 
variances would be an interesting and important extension of the current model. 

Our hndings suggest a need for a next generation of recommender systems which 
would take into account both short-term and long-term goals. One might argue that 
the prime goal of a commercial system is to increase the proht by maximizing the 
recommendation accuracy and the long-term goal of enhancing or at least preserving 
the item diversity is secondary. Our results suggest that this is a short-sighted approach 
as focusing on short-term performance of recommendation may ultimately lead to a 
system where the long tail has been decimated together with its economic potential. 
To overcome this problem, item diversity can be enhanced by sacrihcing a small 
fraction of recommendation’s short-term accuracy in exchange for higher long-term 
diversity. A detailed investigation of various approaches to study the long-term effects of 
recommendation as well as possible trade-offs between short- and long-term performance 
of recommendation are of great interest to both researchers and practitioners in the 
future. 
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